proposed model
Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition
Xi, Xinyu, Yang, Hua, Zhang, Shentai, Liu, Yijie, Sun, Sijin, Fu, Xiuju
Maritime Multi-Scene Recognition is crucial for enhancing the capabilities of intelligent marine robotics, particularly in applications such as marine conservation, environmental monitoring, and disaster response. However, this task presents significant challenges due to environmental interference, where marine conditions degrade image quality, and the complexity of maritime scenes, which requires deeper reasoning for accurate recognition. Pure vision models alone are insufficient to address these issues. To overcome these limitations, we propose a novel multimodal Artificial Intelligence (AI) framework that integrates image data, textual descriptions and classification vectors generated by a Multimodal Large Language Model (MLLM), to provide richer semantic understanding and improve recognition accuracy. Our framework employs an efficient multimodal fusion mechanism to further enhance model robustness and adaptability in complex maritime environments. Experimental results show that our model achieves 98$\%$ accuracy, surpassing previous SOTA models by 3.5$\%$. To optimize deployment on resource-constrained platforms, we adopt activation-aware weight quantization (AWQ) as a lightweight technique, reducing the model size to 68.75MB with only a 0.5$\%$ accuracy drop while significantly lowering computational overhead. This work provides a high-performance solution for real-time maritime scene recognition, enabling Autonomous Surface Vehicles (ASVs) to support environmental monitoring and disaster response in resource-limited settings.
A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction
This paper presents a new voice conversion model capable of transforming both speaking and singing voices. It addresses key challenges in current systems, such as conveying emotions, managing pronunciation and accent changes, and reproducing non-verbal sounds. One of the model's standout features is its ability to perform accent conversion on hybrid voice samples that encompass both speech and singing, allowing it to change the speaker's accent while preserving the original content and prosody. The proposed model uses an encoder-decoder architecture: the encoder is based on HuBERT to process the speech's acoustic and linguistic content, while the HiFi-GAN decoder audio matches the target speaker's voice. The model incorporates fundamental frequency (f0) features and singer embeddings to enhance performance while ensuring the pitch & tone accuracy and vocal identity are preserved during transformation. This approach improves how naturally and flexibly voice style can be transformed, showing strong potential for applications in voice dubbing, content creation, and technologies like Text-to-Speech (TTS) and Interactive Voice Response (IVR) systems.
- Oceania > Australia (0.04)
- North America > United States (0.04)
- North America > Jamaica (0.04)
- (4 more...)
Generalized Diffusion Model with Adjusted Offset Noise
One of the primary objectives of statistical machine learning is to model data distributions, a task that has supported recent advancements in generative artificial intelligence. The goal is to estimate a model that approximates an unknown distribution on the basis of multiple samples drawn from it. For example, when the data consists of images, the estimated model can be used to generate synthetic images that follow the same distribution. Diffusion models [28, 11, 29, 14] have emerged as powerful tools for estimating probability distributions and generating new data samples. They have been shown to outperform other generative models, such as generative adversarial networks (GANs) [6], particularly in image generation tasks [5]. Due to their flexibility and effectiveness, diffusion models are now employed in a wide range of applications, including drug design [3, 8], audio synthesis [17], and text generation [1, 18]. A well-known challenge faced by diffusion models for image generation is their difficulty in producing images with extremely low or high brightness across the entire image [9, 19, 12]. For example, it has been reported that Stable Diffusion [26], a popular diffusion model for text-conditional image generation, struggles to generate fully black or fully white images when given prompts such as "Solid black image" or "A white background" [19].
Blind Spatial Impulse Response Generation from Separate Room- and Scene-Specific Information
Lluís, Francesc, Meyer-Kahlen, Nils
For audio in augmented reality (AR), knowledge of the users' real acoustic environment is crucial for rendering virtual sounds that seamlessly blend into the environment. As acoustic measurements are usually not feasible in practical AR applications, information about the room needs to be inferred from available sound sources. Then, additional sound sources can be rendered with the same room acoustic qualities. Crucially, these are placed at different positions than the sources available for estimation. Here, we propose to use an encoder network trained using a contrastive loss that maps input sounds to a low-dimensional feature space representing only room-specific information. Then, a diffusion-based spatial room impulse response generator is trained to take the latent space and generate a new response, given a new source-receiver position. We show how both room- and position-specific parameters are considered in the final output.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > Rhode Island (0.04)
- Europe > United Kingdom > England > West Yorkshire > Huddersfield (0.04)
- (6 more...)
Developing a Resource-Constraint EdgeAI model for Surface Defect Detection
Mih, Atah Nuh, Cao, Hung, Kawnine, Asfia, Wachowicz, Monica
Resource constraints have restricted several EdgeAI applications to machine learning inference approaches, where models are trained on the cloud and deployed to the edge device. This poses challenges such as bandwidth, latency, and privacy associated with storing data off-site for model building. Training on the edge device can overcome these challenges by eliminating the need to transfer data to another device for storage and model development. On-device training also provides robustness to data variations as models can be retrained on newly acquired data to improve performance. We, therefore, propose a lightweight EdgeAI architecture modified from Xception, for on-device training in a resource-constraint edge environment. We evaluate our model on a PCB defect detection task and compare its performance against existing lightweight models - MobileNetV2, EfficientNetV2B0, and MobileViT-XXS. The results of our experiment show that our model has a remarkable performance with a test accuracy of 73.45% without pre-training. This is comparable to the test accuracy of non-pre-trained MobileViT-XXS (75.40%) and much better than other non-pre-trained models (MobileNetV2 - 50.05%, EfficientNetV2B0 - 54.30%). The test accuracy of our model without pre-training is comparable to pre-trained MobileNetV2 model - 75.45% and better than pre-trained EfficientNetV2B0 model - 58.10%. In terms of memory efficiency, our model performs better than EfficientNetV2B0 and MobileViT-XXS. We find that the resource efficiency of machine learning models does not solely depend on the number of parameters but also depends on architectural considerations. Our method can be applied to other resource-constraint applications while maintaining significant performance.
- Oceania > Australia (0.04)
- North America > Canada > New Brunswick > Fredericton (0.04)
Revolutionizing Healthcare Image Analysis in Pandemic-Based Fog-Cloud Computing Architectures
Elsayed, Al Zahraa, Mohamed, Khalil, Harb, Hany
The emergence of pandemics has significantly emphasized the need for effective solutions in healthcare data analysis. One particular challenge in this domain is the manual examination of medical images, such as X-rays and CT scans. This process is time-consuming and involves the logistical complexities of transferring these images to centralized cloud computing servers. Additionally, the speed and accuracy of image analysis are vital for efficient healthcare image management. This research paper introduces an innovative healthcare architecture that tackles the challenges of analysis efficiency and accuracy by harnessing the capabilities of Artificial Intelligence (AI). Specifically, the proposed architecture utilizes fog computing and presents a modified Convolutional Neural Network (CNN) designed specifically for image analysis. Different architectures of CNN layers are thoroughly explored and evaluated to optimize overall performance. To demonstrate the effectiveness of the proposed approach, a dataset of X-ray images is utilized for analysis and evaluation. Comparative assessments are conducted against recent models such as VGG16, VGG19, MobileNet, and related research papers. Notably, the proposed approach achieves an exceptional accuracy rate of 99.88% in classifying normal cases, accompanied by a validation rate of 96.5%, precision and recall rates of 100%, and an F1 score of 100%. These results highlight the immense potential of fog computing and modified CNNs in revolutionizing healthcare image analysis and diagnosis, not only during pandemics but also in the future. By leveraging these technologies, healthcare professionals can enhance the efficiency and accuracy of medical image analysis, leading to improved patient care and outcomes.
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Pakistan (0.04)
- Asia > Middle East > Qatar (0.04)
- (3 more...)
- Overview (0.68)
- Research Report > New Finding (0.46)
Audio Representation Learning by Distilling Video as Privileged Information
Hajavi, Amirhossein, Etemad, Ali
Deep audio representation learning using multi-modal audio-visual data often leads to a better performance compared to uni-modal approaches. However, in real-world scenarios both modalities are not always available at the time of inference, leading to performance degradation by models trained for multi-modal inference. In this work, we propose a novel approach for deep audio representation learning using audio-visual data when the video modality is absent at inference. For this purpose, we adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI). While the previous methods proposed for LUPI use soft-labels generated by the teacher, in our proposed method we use embeddings learned by the teacher to train the student network. We integrate our method in two different settings: sequential data where the features are divided into multiple segments throughout time, and non-sequential data where the entire features are treated as one whole segment. In the non-sequential setting both the teacher and student networks are comprised of an encoder component and a task header. We use the embeddings produced by the encoder component of the teacher to train the encoder of the student, while the task header of the student is trained using ground-truth labels. In the sequential setting, the networks have an additional aggregation component that is placed between the encoder and task header. We use two sets of embeddings produced by the encoder and aggregation component of the teacher to train the student. Similar to the non-sequential setting, the task header of the student network is trained using ground-truth labels. We test our framework on two different audio-visual tasks, namely speaker recognition and speech emotion recognition and show considerable improvements over sole audio-based recognition as well as prior works that use LUPI.
Document Image Binarization in JPEG Compressed Domain using Dual Discriminator Generative Adversarial Networks
Rajesh, Bulla, Agrawal, Manav Kamlesh, Bhuva, Milan, Kishore, Kisalaya, Javed, Mohammed
Image binarization techniques are being popularly used in enhancement of noisy and/or degraded images catering different Document Image Anlaysis (DIA) applications like word spotting, document retrieval, and OCR. Most of the existing techniques focus on feeding pixel images into the Convolution Neural Networks to accomplish document binarization, which may not produce effective results when working with compressed images that need to be processed without full decompression. Therefore in this research paper, the idea of document image binarization directly using JPEG compressed stream of document images is proposed by employing Dual Discriminator Generative Adversarial Networks (DD-GANs). Here the two discriminator networks - Global and Local work on different image ratios and use focal loss as generator loss. The proposed model has been thoroughly tested with different versions of DIBCO dataset having challenges like holes, erased or smudged ink, dust, and misplaced fibres. The model proved to be highly robust, efficient both in terms of time and space complexities, and also resulted in state-of-the-art performance in JPEG compressed domain.
Saturated Models, Deviance and the Derivation of Sum of Squares
Before we discuss deviance, we first need to understand what the Saturated, Proposed and Null Models are. A Saturated Model is where the number of parameters/coefficients is equal to the number of data points. This is like a'connect the dots' model where the line or curve passes through each point. This is considered to be the perfect model as it takes into account all the variance in the data and has the maximum achievable likelihood. A Null Model is the opposite with only one parameter, which is the intercept.
A generative, predictive model for menstrual cycle lengths that accounts for potential self-tracking artifacts in mobile health data
Li, Kathy, Urteaga, Iñigo, Shea, Amanda, Vitzthum, Virginia J., Wiggins, Chris H., Elhadad, Noémie
Mobile health (mHealth) apps such as menstrual trackers provide a rich source of self-tracked health observations that can be leveraged for health-relevant research. However, such data streams have questionable reliability since they hinge on user adherence to the app. Therefore, it is crucial for researchers to separate true behavior from self-tracking artifacts. By taking a machine learning approach to modeling self-tracked cycle lengths, we can both make more informed predictions and learn the underlying structure of the observed data. In this work, we propose and evaluate a hierarchical, generative model for predicting next cycle length based on previously-tracked cycle lengths that accounts explicitly for the possibility of users skipping tracking their period. Our model offers several advantages: 1) accounting explicitly for self-tracking artifacts yields better prediction accuracy as likelihood of skipping increases; 2) because it is a generative model, predictions can be updated online as a given cycle evolves, and we can gain interpretable insight into how these predictions change over time; and 3) its hierarchical nature enables modeling of an individual's cycle length history while incorporating population-level information. Our experiments using mHealth cycle length data encompassing over 186,000 menstruators with over 2 million natural menstrual cycles show that our method yields state-of-the-art performance against neural network-based and summary statistic-based baselines, while providing insights on disentangling menstrual patterns from self-tracking artifacts. This work can benefit users, mHealth app developers, and researchers in better understanding cycle patterns and user adherence.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Germany > Berlin (0.04)
- South America > Bolivia (0.04)
- (5 more...)
- Research Report > Experimental Study (0.67)
- Research Report > New Finding (0.46)